hpc application
ClusterRCA: An End-to-End Approach for Network Fault Localization and Classification for HPC System
Sun, Yongqian, Pan, Xijie, Xiong, Xiao, Tao, Lei, Wang, Jiaju, Zhang, Shenglin, Yuan, Yuan, Li, Yuqi, Jian, Kunlin
Network failure diagnosis is challenging yet critical for high-performance computing (HPC) systems. Existing methods cannot be directly applied to HPC scenarios due to data heterogeneity and lack of accuracy. This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data. ClusterRCA extracts features from topologically connected network interface controller (NIC) pairs to analyze the diverse, multimodal data in HPC systems. To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches. A failure graph is constructed based on the output of the state classifier, and then it performs a customized random walk on the graph to localize the root cause. Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves high accuracy in diagnosing network failure for HPC systems. ClusterRCA also maintains robust performance across different application scenarios.
- North America > United States (0.14)
- Europe > United Kingdom (0.04)
- Europe > Sweden > Uppsala County > Uppsala (0.04)
- Asia > China > Tianjin Province > Tianjin (0.04)
- Energy (0.47)
- Telecommunications (0.47)
- Information Technology (0.46)
HPC Application Parameter Autotuning on Edge Devices: A Bandit Learning Approach
Hossain, Abrar, Badawy, Abdel-Hameed A., Islam, Mohammad A., Patki, Tapasya, Ahmed, Kishwar
The growing necessity for enhanced processing capabilities in edge devices with limited resources has led us to develop effective methods for improving high-performance computing (HPC) applications. In this paper, we introduce LASP (Lightweight Autotuning of Scientific Application Parameters), a novel strategy designed to address the parameter search space challenge in edge devices. Our strategy employs a multi-armed bandit (MAB) technique focused on online exploration and exploitation. Notably, LASP takes a dynamic approach, adapting seamlessly to changing environments. We tested LASP with four HPC applications: Lulesh, Kripke, Clomp, and Hypre. Its lightweight nature makes it particularly well-suited for resource-constrained edge devices. By employing the MAB framework to efficiently navigate the search space, we achieved significant performance improvements while adhering to the stringent computational limits of edge devices. Our experimental results demonstrate the effectiveness of LASP in optimizing parameter search on edge devices.
- North America > United States > California > Alameda County > Livermore (0.04)
- North America > United States > Texas (0.04)
- North America > United States > New Mexico (0.04)
- (2 more...)
- Energy (0.88)
- Information Technology (0.68)
GPU Dedicated Servers with RTX 3090, A100 80GB, RTX A6000
NVIDIA A100 HBM Ampere GPU 80GB premiere the world's fastest memory bandwidth at over 2 terabytes per second to run the largest simulation models and datasets. It allows researchers to quickly deliver accurate results and deploy solutions into production at scale. NVIDIA A100 Tensor Cores with Tensor Float (TF32) provide up to 20x higher performance over the NVIDIA Volta with zero code changes and an additional 2x boost with automatic mixed precision and FP16. For the largest models with enormous data tables like deep learning recommendation models (DLRM), Ampere A100 80GB GPU reaches 1.3 TB of unified memory per node and delivers up to a 3x throughput increase over A100 40GB GPU. In MLPerf, it has set multiple performance records in the industry-wide benchmark for AI training.
- Information Technology > Hardware (1.00)
- Information Technology > Graphics (1.00)
- Information Technology > Artificial Intelligence (1.00)
HPE and Cerebras build new AI supercomputer at LRZ in Munich
HPE and Cerebras Systems have built a new AI supercomputer in Munich, Germany, pairing a HPE Superdome Flex with the AI accelerator technology from Cerebras for use by the scientific and engineering community. The new system, created for the Leibniz Supercomputing Center (LRZ) in Munich, is being deployed to meet the current and expected future compute needs of researchers, including larger deep learning neural network models and the emergence of multi-modal problems that involve multiple data types such as images and speech, according to Laura Schulz, LRZ's head of Strategic Developments and Partnerships. "We're seeing an increase in large data volumes coming at us that need more and more processing, and models that are taking months to train, we want to be able to speed that up," Schulz said. "And then we're also seeing multi-modal problems, such as integration of natural language processing (NLP) and medical imaging or documents, so we have this complexity, we have this the need for faster, we have this need for bigger that's coming from our user side, from our facility side, and we need to make sure that we're constantly evaluating to have these different novel architectures, to have different usage models to be able to understand all that." The LRZ team decided that the Cerebras technology, with its large shared memory and scalability, was a good match for the "pain points" they were trying to resolve, she said.
Python's increasing popularity in scientific and high-performance computing
Python is an experiment on how much freedom programmers need. Too much freedom and nobody can read another's code; too little and expressiveness is endangered. Last year, Python was named the most popular programming language. The language's growing popularity can be attributed to the rise of data science and the machine learning ecosystem and corresponding software libraries like Pandas, Tensorflow, PyTorch, and NumPy, among others. The fact that it is so easy to learn helps Python gain favour among the programmers' community.
Scientific Machine Learning and HPC-AI Technology Convergence - insideHPC
Some of the most well-known examples of the use of machine learning technics in science applications are the detection and classification of gravitational-waves signals from LIGO and Virgo in astrophysics [1], the recent DeepMind Alpha-Fold2 capabilities outperforming classical methods in protein folding [2] or the winning team of the Gordon Bell 2020 with the Deep Potential Molecular Dynamics [3] which is opening new breakthroughs in the drug design process and could speed up future pandemic response efforts. Beyond these key examples, the convergence between HPC and AI is natural where DL-based surrogate modelling is more and more widely applied in research and recent advances in physics-informed neural networks such as HNN [4] bring physical properties and constraints to neural networks loss functions opening a great path towards a new generation of simulation. In Atos, we built a dedicated approach to support the scientific community and Industries by bringing data science and HPC expertise through the Atos Centers of Excellence. Each center is oriented towards a specific domain where our experts and our customers can jointly bring innovations and technologies with the support of some of our partners. Some of the first Atos Centers of Excellence are dedicated to weather forecast & climate changes [5] and life sciences [6].
Supermicro Accelerates AI and Deep Learning with NGC-Ready Servers - insideHPC
Today Supermicro announced the industry's broadest portfolio of validated NGC-Ready systems optimized to accelerate AI and deep learning applications. Supermicro is highlighting many of these systems today at the Supermicro GPU Live Forum in conjunction with NVIDIA GTC Digital. Supermicro NGC-Ready systems allow customers to train AI models using NVIDIA V100 Tensor Core GPUs and to perform inference using NVIDIA T4 Tensor Core GPUs. NGC hosts GPU-optimized software containers for deep learning, machine learning and HPC applications, pre-trained models, and SDKs that can run anywhere the Supermicro NGC-Ready systems are deployed whether in data centers, cloud, edge micro-datacenters, or in distributed remote locations as environment-resilient and secured NVIDIA-Ready for Edge servers powered by the NVIDIA EGX intelligent edge platform. With over 26 years of experience delivering state-of-the-art computing solutions, Supermicro systems are the most power-efficient, the highest performing, and the best value," said Charles Liang, CEO and president of Supermicro. "With support for fast networking and storage, as well as NVIDIA GPUs, our Supermicro NGC-Ready systems are the most scalable and reliable servers to support AI. Customers can run their AI infrastructure with the highest ROI." Supermicro currently leads the industry with the broadest portfolio of NGC-Ready Servers optimized for data center and cloud deployments and is continuing to expand its portfolio. In addition, the company offers five validated NGC-Ready for Edge servers (EGX) optimized for edge inferencing applications. NVIDIA's container registry, NGC, enables superior performance for deep learning frameworks and pre-trained AI models with state-of-the-art accuracy," said Ian Buck, vice president and general manager of Accelerated Computing at NVIDIA.
Video: What Can HPC on AWS Do? - insideHPC
In this video from the HPC User Forum at Argonne, Ian Colle from Amazon presents: What Can HPC on AWS Do? AWS provides the most elastic and scalable cloud infrastructure to run your HPC applications. With virtually unlimited capacity, engineers, researchers, and HPC system owners can innovate beyond the limitations of on-premises HPC infrastructure. AWS delivers an integrated suite of services that provides everything needed to quickly and easily build and manage HPC clusters in the cloud to run the most compute intensive workloads across various industry verticals. These workloads span the traditional HPC applications, like genomics, computational chemistry, financial risk modeling, computer aided engineering, weather prediction, and seismic imaging, as well as emerging applications, like machine learning, deep learning, and autonomous driving." Ian Colle joined AWS as the General Manager for AWS Batch and HPC in November 2017.
- North America > United States > Massachusetts (0.07)
- North America > United States > Illinois (0.07)
Seamlessly scaling HPC and AI initiatives with HPE leading-edge technology
Accelerate your HPC and AI workloads with new products, advanced technologies, and services from HPE. A growing number of commercial businesses are implementing HPC solutions to derive actionable business insights, to run higher performance applications and to gain a competitive advantage. In fact, according to Hyperion Research, the HPC market exceeded expectations with 6.8% growth in 2018 with continued growth expected through 2023.1 Complexities abound as HPC becomes more pervasive across industries and markets, especially as you adopt, scale and optimize HPC and AI workloads. HPE is in lockstep with you along your AI journey. We help you get started with your AI transformation and scale more quickly, saving time and resources.
Will The Harmonic Convergence Of HPC And AI Last?
History and economics – as if you could separate the two – are burgeoning with examples of products being developed for one task and then being used, perhaps after some tweaking, for an entirely new and usually unexpected task. History is also full of stories of technologies aimed squarely at a task that, for one reason or another, miss the mark even if it looks like they were right on target. Product substitution as a means of lowering costs and thereby making a technology more prevalent is one of the primary reasons that economies exist. Some people make money in the transformation, and others lose out, but the overall economy improves from the efficiency engendered in that change. So it is a net good, and if done right, there is some money left over to invest in something else entirely. Every once in a while, you get a product substitution working from two different angles, and you can get a whole bunch of different things converging on a technology.
- Energy (0.47)
- Government > Regional Government (0.47)
- Information Technology (0.32)